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ABSTRACT 

A method was developed for the construction of probabilistic 
state-space models for non-repairable systems. This method allov/s 
the construction of system models with considerably fewer states 
than the model resulting from more traditional approaches. Models 
were developed for several systems which achieved reliability improve- 
ment by means of erir’or-coding, modularized sparing, massive replication 
and other fault- tolerant techniques. 

From the models developed, sets of reliability and coverage 
equations for the systems were developed. Comparative analyses of the 
systems were performed using these equation sets. In addition, the 
effects of varying subunit reliabilities on system reliability and 
coverage were described. The results of these analyses indicated 
that a significant gain in system reliability may be achieved by use of 
combinations of modularized sparing, error coding and software error 
control. For sufficiently reliable system subunits, this gain may far 
exceed the reliability gain achieved by use of massive replication 
techniques , yet resul t i n a considerabl e saving i n system cost . 
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I. .. INTRODUCTION . 

As the field of computing system design has developed, the need 
for reliable computers has become crucial. Advances in the aerospace 
area in particular have necessitated the design of computing systems that 
are highly reliable and capable of operation in a non-repairable environ- 
ment. In many other system applications, while repair may be possible, 
an interruption in system operation is unacceptable. 

Due to the large number of components which it contains, the main 
memory has typically been the most unreliable subunit of the computing 
system [1].: Since this subunit contributes a high percentage of total 
system size and weight and many systems must operate within limitations 
in these areas, massive replication techniques for memory reliability 
improvement are often not applicable. Thus, much research has been 
performed to find methods of memory reliafiil ity improvement by other means 
Several methods of improvement have been utilized. One such 
method is the development of error-control codes for Use in the memory 
array. Also, modular memory organizations Have been designed in an 
attempt to limit the possible ways that stored-word errors can occur 
and to e^se system reconfiguration problems. The example systems of 
this paper utilize both coding and modular design for improved system 
reliability. These systems are described in Chapter II. : 

A method is presented in this paper for calculating the reliability 


and coverage of these systems. This method allows the construction of 
system state diagrams with fewer states than occur in many state-space 
approaches. . The method used is described in Chapter III with example 
results shown in Chapter V. 

Ba dcground 

RejjlicaticiH on the memory system level [Z, 3] has been used as a 
solution for the ultra-reliable memory problem. Substantial increase 
in memory reliability has resulted in many cases. System cost, however j 
has increased linearly with the number of duplicated systems. Other 
limiting factors, such as system W:ight and size, have prevented the use 
of mass i ve repl i cation techniques i n many appl i ca ti ons . 

A number of proposed and actual systems [1, 4, 5, 6, 7, 8] have 
utilized a modular concept of memory arrangement, usually in conjunction 
with error coding. In addition, a number [9, 10, 11, 12]. of burst-error 
correcting codes have been developed. These codes are well suited for use 
in word-slice oriented memories in which a majority of the word errors 
may be expected to occur within groups of word bits. 

Several articles [13, 14, 15] have developed reliability calculation 

procedures for the fault- tolerant memory problem. Many others [16, 17, 18J 

have shown calculation procedures for fault-tolerant systems in general. 

When a state-space approach to system modeling has been taken, the time 

allowed for state transitions to. occur is generally aim At. TypicaTly, 

At ^ 0 

only one system event is allowed to occur in this transition interval. 
Multiple states are then hecessary to represent all possible combinations 
of conditions of system subunits resulting in large numbers of states for 
highly complex systems. ■ 


II. FAULT TOLERANT MEMORY DESCRIPTION 

In this chapter, several fault- tolerant memory systems are 
described. The first section describes a system which is taken as 
a basis for the comparison of related systems. Several related systems 
are described in the second section. Reliability and coverage 
computations for these systems will be examined in following chapters. 

Basic System 

The basic compute?' system to be analyzed has been designed for usp 
in extended aerospace- r’lvrsions. It was desirable to implement the 
computer memory in a mannev' so as to be within weight, size, and 
economic limitations, yet be highly fault-tolerant. 

A modular design approach has been undertaken in which the memory 
array is made up of memory slices, each of which contains the same bit 
location of all memory words. If n words are contained in the memory 
and each word is k bits long, then there must be k memory modules and 
each module must contain n bits. These modules will be referred to as 
on-line bit planes . 

In addition to the bit planes already discussed, the system contains 
identically-sized spare bit planes which may be switched to replace any 
failed on-line bit plane. The arrangement of on-line and spare bit 
planes is shewn in Figure 1. The functional orientation of memory words 
is shown in Figure 2. 



















■ A si ngle-error-correcting/doubl e-error-detecting (22, 16) code [19] 
is used for memory data word encoding. This code has the property that 
any odd number of errors in a codeword will produce an odd-weighted 
error syndrome. Double errors will produce a non-zero. even-weighted error 
syndrome and higher numbers of even errors will produce even-weighted 
(including all zero) error syndromes. These features of the code 
will be further discussed In a later section. 

External to the* memory, data words are encoded using only 2 byte 
parity bits. For this reason, circuitry which translates between the two 
codes is necessary for use in memory write and read cycles. This 
function is performed by the memory transl ator . In addition, the translator 
contains circuitry for the correction of single bit errors and detection 
of multiple bit errors in memory words, and control of the reconfigura- 
tion switching circuitry which directs each word bit to the appropriate 
bit plane. These functions will now be examined. 

For a memory write operation, the translator accepts a byte-parity 
encoded word from the CPU-memory bus. The byte parity bits are saved 
and the check bits for the SEC/DED code are generated. A validity 
check is then made by a comparison of the saved byte parity bits with 
the generated check bits. If no error is found, the data word with SEC/DED 
check bits appended is stored in the memory. If an error is found, a 
program interrupt is sent to the CPU. 

^ operation, the requested encoded word is read 

from the memory array and placed in the storage data register (SDR) . The 
error syndrome for the word is formed from the encoded word and if a zero 
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(no error) syndrome Is signaled, the byte parity bits for the data word 
are formed and the word is transmitted on the data bus. 

An odd-weight (odd error) syndrome signal causes a bit inversion to 
be made by the single error correction circuitry. The error syndrome for 
the corrected word is then generated. If no error is signalled, then 
it is assumed that there was a single error in the encoded word. The 
byte parity bits are generated and the word is transmitted on the data 
bus. If an error is signaled, a program interrupt is generated. 

When the translator receives the information that a certain 
designated spare bit plane is to replace an on-line bit plane, it must 
reconfigure the memory array input and output switching to reflect this 
change. Memory input switching is reconfigured first. Each memory word 
is then read from the on-line array, corrected if necessary, and 
re-written in the on-line array with the spare bit plane replacing the 
designated on-line bit plane. After all memory words have been read and 
restored, the memory array output switching is reconfigured appropriately 

The decision to replace an on-line bit plane may be arrived at by 
use of various switching strategies.. It is assumed for the basic system 
that the reconfiguration signal is issued by the CPU as a result of 
error signals received from the translator. It is also assumed that the 
switching strategy is to replace a bit plane as soon as it is detected 
that the bit plane contains an error. Another switching strategy will 
be discussed in a following section. 

In the basic system, there is assumed to be no facil ity availabl e 
for the correction of multiple errors.. If system failure is defined to. 


be the occurrence of a non-correctabl e error, then the occurrence of 
more than one error in a single memory word will constitute failure for 
this system. For purposes of sys ton modeling, the occurrence of 
simultaneous failures in multiple bit planes is assumed to be 
equivalent to the occurrence of multiple errors in a single memory word. 
System failure, then, will occur when more than one on-line bit plane 
has failed. 

Spare bit planes are assumed to operate in a mode identical to the 
on-line bit planes prior to their insertion into the on-line array. 

Spare bit planes, then, fail with the same characteristics as the 
on-line units. It is also assumed that after a bit plane has been 
removed from the on-line array, it is never re-inserted. A bit plane 
which has been replaced is called an unavailable spare . A spare bit 
plane which has not been inserted into the on-line array and which may 
or may not be failed Is an available spare . 

The system, then, may be divided into subsystems by function. These 
subsystems are: 

1) The on-line memory array consisting of a number of bit 
planes, 

2) The spare bit plane array including both available and 
unavailable spares, 

3) The error detection circuitry of the translator, 

4) The error correction circuitry of the translator, 

5) The reconfiguration switching array, and 

6) The encoding and decoding subsystems of the translator. 
References will be made to these subsystems in following sections. 
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Alternate Designs 

Several fault-tolerant memory systems which are related to the 
basic system have been studied. Four of these systems will be described 
in this section. 

The non-spared system is identical to the basic system except that 
no spare bit planes are provided. In addition, no reconfiguration switch- 
ing circuitry is included, since such circuitry would have no use in this 
system. Comparisons made between this system and the basic system will 
show the relative improvement to be gained by the use of the spare bit 
plane approach. 

The TMR system consists of three systems of the non-spared type 
in a triple modular redundant configuration. The functional operation 
of this system may be described as follows: 

1) For a memory write operation, SEC/DED-encoded word is 
stored in the same logical location in three memories. 

2) For a memory read operation, the requested memory location 
is read in all three memories. Single error correction 

is performed independently by the systems and byte parity 
bits are generated in each case. The three byte-parity 
encoded words are then voted on by majority logic in a 
bit-by-bit fashion. The output word is constructed by 
using the majority vote for each bit. If the constructed 
word is still a codeword, it is transmitted on the data bus. 
If it is not a codeword, an error program interrupt is 
generated. 

This system, then, will produce the correct output word as long as 
at least two of the three memories can construct the correct word. A 
functional depiction of this system is shown in Figure 3, 

The dupl icated system is composed of two identical non-spared 
subsystems. Data to be loaded is stored in the same logical location in 
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both subsystems. Data read from the system is read from only one memory. 
If a non-correctable error is signalled by the on-line unit, output 
bussing is switched to the other unit and the data is read from the same 
location. If both subsystems signal a non-correctable error in the 
same memory word, an error program interrupt is generated. 

The double-error-correctinq system is a modified version of the 
basic system which will correct double errors and detect a triple error 
which produces a single error syndrome. The additional features are 
achieved by the use of software routines [20] whi^h are CPU implemented. 
Since double errors are correctable in this system, a reconfiguration 
switching strategy is assumed in which an on-line bit plane is replaced 
only if it contains an erroneous bit position of a word which has two 
or more errors. This syston will be more fully discussed in a later 
chapter. 


III. RELIABILITY MODEL DEVELOPMENT 


In this chapter, a generalized method for the computation of 
reliability, the probability of satisfactory operation, and coverage, 
the probability of recovery if a failure occurs, for a system is 
described. This method is applied to form sets of reliability and 
coverage equations for the basic system described in the preceeding 
chapter. Computer implementation of these equations is examined in the 
last section. 


General Techniques 

Prior to the development, it is appropriate that certain notation 
be defined. A listing of notation used is shown in Table 1. 

For the purpose of reliability computation, the performance of a 
device may often be represented as a set of states and state transitions 
Suppose, for example, that a certain non-repairable device has three 
possible modes of operation: 

1) Satisfactory operation, 

Z) Degraded operation caused by event A which occurred while 
the device was operating satisfactorily, and 

3) Unsatisfactory operation caused by event B which occurred 
while the device was operating satisfactorily or by event 
C which occurred while the device was operating in its 
degraded mode. 

These three modes of operation form three natural states for the device. 
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Notatlon 

P(x) 

P(x»r) 

P(x or r) 

P(x/r) 


or 


Pj^(t,At) 

P .(t,At/i) 
P. .(t,At) 

< sJ 

P,-(t) 

P^(t + At,j) 
P^(t + At/a) 


rj(t) 

r(t) 
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TABLE 1 , Definition of Notation 


Meaning 


Probability of the occurrence of event x 

Probability of the occurrence of events x and r 

Probability of the occurrence of event x or event r 
or both 

Probability of the occurrence of event x given that 
event r has occurred 

Probability of the occurrence of event x in the time 
period from t to t + At 

Probability of the occurrence of a transition from state 
i to state j in the time period from t to t + At given 
that the state at time t is i 

Probability that the system is in state i at time t 


Probability that the system is in state i at time t + At 
and that it was in state j' at time t 

Probability that the systan is in state i at time t + At 
given that it was in state a at time t 

Probability that component j is non-fail ed at time t 

Probability that a generalized component is non-f ailed 
at time t 
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If the assumption is made that the system Is operating iri state 1 
(satisfactory operation) at time t, then the probability that the system 
will be in state 2 (degraded operation) at time t + At, a small interval 
of time later, is the probability that event A occurred in the time 
period from t to t + At, i’n equation form 

P;j^2 (t,At/l) = (t,At) 

where P^ 2 (t,At/l) is the probability of a transition from state 1 to 
state 2 in time t to t + At given that the system was in state 1 at 
time t and P^^ (t, At) is the probability of the occurrence of event A 
in the same time period. 

In a similar manner, the transition probabilities into state 3 
(the failed state) are 


P -]^3 (t,At/l ) = Pg (t. At), and 
P2,3 (t,At/2) = Pj. (t. At). 

This state model can be described graphically by a state diagram as shown 


in Figure 4. 


An equivalent form of device state representation is a matrix T 

' ; . , . ^ ^ ■ .. 

which has as its i,j entry P. . (t,At/1) for i a and 1 - )] T[i,kJ 

— kpT 

for i = j, where N is the number of device states. The T matrix for the 
example device is given below. 



r 


1 


1 

l-P (t,At) 
“Pg\‘t,At) 



0 


0 


2 3 

P^(t,At) Pg(t,At)' 

l-P^(t,At) Pg(t,At) 
0 1 



FIGURE 4. State Diagram and Transition 

ProbabiTities for Example Device 





I ^ ^ ^ ^ ^ ^ ^ , , --- -;- 
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Deleting the (t,At) subscripts yields 
1 2 3 

P P 

-Pg 

0 1-P^ P 

0 0 

» 

The probability that the system is in any given state at time 
t + At may be expressed in terms of the transition probabilities and 

S 

the distribution of state probabilities at time t. These equations may 
be obtained fay assuming that the system is operating in a state i at time t 
and by computing the probability of the occurrence of the transition 
event to state j in the time period from t to At. 

For^the development, the following notational convention will be 
used. ^ 

P(system operating in state i at t + At given that the ■ 
system was in state j at time t) = P^(t + At/j). 

To obtain the equation for P-j{t + At/1), it must be considered 
that for the system to be in state 1 at time t + At, no state transition 
out of state 1 may occur between t and t + At. Then the complement of 
the two state transitions out of state 1 must be combined as follows: 

P-l (t + At/T) = (1 - P^^2 (t,At/l))(l - P^^3 (t,At/l)) 

= 1 - P^^2 (t,At/l) - P^^3 (t,At/l) 

;+ P^^2 (tjAt/l) P^^3 (t,At/l) 
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If At is defined to be a period of time which is small enough to allow 
only one state transition to take place, the last term in this equation 
becomes negligible since it defines the probability of more than one state 
transition occurring in time t to t + At, Then, 

(t + At/1) = 1 - - ^1,3 

= 1 - P^ (t,At) - Pg Ct,At) 

Recalling that 

P(X/Y) [25], 

then 

P^(t + At,l) 

P-J (t + At,l) = P-j(t)(l - P^(t,At) - PB(t,At)). 

Since there are no transition paths into state 1, the event "the 
system is in state 1 at t + At" implies the event "the system is in state 
1 at time t." Then, 

P^{t + At, 1) = P^tt + At) 

So, 

P^(t + At) = P^{t){l - P^(t,At) - Pg(t,At)). 

a 

There are two ways for the system to be in state 2 at time t + At. 
Either the system was operating in state 1 at time t and the transition 
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from state 1 to state 2 occurred in the time period from t to t + At, 
or the system was operating in state 2 at time t and no transition event 
out of state 2 occurred in t to t + At. 

The equation for p 2 (t + At) may then be formed as follows: 

P2(t + At) = ^ 

= P^tt,At)P^(t) + (1 - Pj,Ct,At))P2Ct). 

By similar reasoning. 


Pg(t + At) = PgCt + At,l) + P3(t + At,2) + P^^t + At,3) 

= P^^3(t,At/l)P^(t) + P2^3(t,At/2jp2{t)> PsC^ + ^t>3) 
= PB(t,Atjp^(t) + PB(t,At)p2(t) + Pgtt + At,3). 

Since there are no transition paths out of state 3, the probability that 
the system is in state 3 at t + At and that it was in state 3 at time t is 
the probability of the latter condition, or 

P3(t + At,3)'= P3(t), 

By substitution, the equation for p 3 (t t At) becomes 


Pgtt + At) = Pg(t,At)P-jCt) + PBCt,At)p2(t) + P3(t). 


In general, the state probability equation for state i is 


PV(t + At) = ^ P. .{t,At/j)P.(t) + (1 
dj^i 


- I P. . (t,At/T))P;-(t) 

k=l V 

kj^i 
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where n is the number of system states. The first summation in this 
equation represents the sum of the probabilities of all possible 
transitions into state i from another state. The coefficient of 
P^.(t) is the probability that no transition out of state i will occur 
in t to t + At given that the system was in state i at time t, 

S* nee for each term of the form P „(t,At/u). the u inside the ; 

U j V 

parentheses is redundant, this probability may be represented as 
P (t,At) where the deleted u is understood. 

U 3 V 

The general state probability equation then becomes 

P.Ct + At)= I P. -(t,At)P-(t) + tv- I P. Jt,At)}P (t). 

^ j=l J IjN I 

j/i kri (3-1) 

If vectors _P(t + At) and £(t) are defined by 



P-j(t + At) 

P2(t + At) 


1*1 (t) 
Pg(t) 

^(t + At) 

•*. 

Pn( t + At) 

, P(t) = 

* 

p„(t) 


then equation (3-1) may be represented in matrix form as 

^(t + At)> X P.(t) (3-2) 

where T is previously defined and T^ is the transpose of T. 

In a complex system, the events which cause state transitions may be 
composed of many subevents which must occur for the transition event to 
occur. It may be more desirable to work with these subevent probabilities 
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than to attempt to determine the probability of the overall event. For 
this reason* it is necessary to analyze the possible types of subevents 
and to be able to calculate the probability of occurrence of each type. 

For any transition event i,j with probability of occurrence P. .(t,At) 
it is possible to place any subevent in exactly one of the following six 
event classes: 

1. The failure event of a system component or component group 
pri or to time t + At. 

2. The non-failure event of a system component or component 
group prior to time t + At. 

3. The failure event of a system component or component group 
^ in the time period from t to t + At, 

4. The non-failure event of a system component or component 
group in the time period from t to t + At. 

5. The failure event of a system component or component 
group in the time period from t to t + At given its non- 
failure prior to t. 

6. The non-failure event of a system component or component 
group in the time period from t to t + At given its non- 
failure prior to t. 

In order to compute the probability of events in each of these 
classes, it is necessary to first examine the basis for the computation 
of failure probabilities. 

Each system component or component group has associated with it a 
failure probability density function, f(t). In the general case. 



The apriori probability of component (group) failure in the time period 
from t^ to t^ may be expressed as 


2 

Pp = y' ftt)dt, and 

the aprior'i relfabllity of the component (group) at time t^ is 

' ^3 ' 

= 1 - r f(t)dt 

CO . , 

“ y ■f‘t't)dt. 

to 


If the assumption is made that, at time t^ 0, a particular 
component (group) is non-failed then the probability of failure prior 
to this time is 0 and the probability of failure after t-j is 1. Then, 


f f (t)dt = 1. 

From [26], for f{t) exponential, f (t)=f{t-t^). For this and following 
developments, all failure density functions will be assumed to be of 
the exponential type- 


If the failure probability of this component in the time period from 
t.j to t^ + At is of concern, then 

t, + At t^ + At 

f f' (t)dt = J f(t-t,)dt 

t t ‘ 

n 1 

At , . . 

= J f(t)dt 
0 
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Which is the probability that the component (group) will fail in the 
interval from t to t + At given that it is non-failed prior to time t. 

By use of these concepts, the subevent probabilities for each class 

may now be computed as follows: 

t + At 

Class 1. = J* f(t)dt 

0 

00 

= 1 -J f(t)dt 

t + At 

= 1 - r(t + At) . 
t At 

Class 2. P 2 “ ^ ~ y* f(t)dt 

0 


= r(t + At), 
t + At 

Class 3. P 2 = jf f{t)dt 

t 

CO CO 

" S “ S f(t)dt 

t t + At 


= r(t) - r{t + At). 


t + At 

Class 4. P^ = 1 ~ f f(t)dt 

t 



= 1 - r(t) + r(t + At). 
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At 

Class 5. P5 = / f(t)dt 
0 

= 1 - r(At) . 

At 

Class 6. Pg “ ^ " y' ■P(t)dt 

0 

= 1 - P5 

= r(At) . 

To completely specify the state probabilities, it is necessary to 
select a base time, In general, niay be any time at which 

all state probabilities are known. The following discussion will 
assume that t^^^gg is 0. It is common to denote one system state, m, as 
the starting state and assume that 

’’m't = *bass = 0) = 1. 

° ° 

The state probabilities may be computed for any t > 0 if: 

1. All state transition equations are known, and 

2. All system component (group) reliability equations are 
known. 

To obtain a closed-form solution for each probability equation, 
it is common to rearrange each equation into its differential form and 
solve the equation set simultaneously. By making simplifying assumptions 
the equation set may be approximated by a set of linear differential 
equations. For systems with a large number of states, however, the 
simultaneous solution problem may become quite involved. In addition 



if the analysis of a related system is desired, only a slight difference 
in architecture or operation may necessitate the re-derivation of all 
state equations. 

If computer evaluation of state probabilities is possible, however, 
the open form of the state probability equations may yield satisfactory 
results at considerable savings in effort. In addition, no simplifying 
assumptions need be made to assure equation linearity. State probability 
equations to be derived in this paper will remain in this open form. 

Reliability Equations 

For the basic memory system, the insertion of each spare bit plane 
on-line performs a natural partitioning of system states. By determining 
the number of available spares it is possible to define the state of 
the system. If the basic system has k bits per memory word and s spare 
bit planes initially available, the system state diagram may be constructed 
as shown in Figure 5. 

For each state i (1 5 i £ s+1) in this diagram, the system is 
operating with exactly s - i + 1 spare bit planes available, and no 
failed bits in any word (no failed bit planes on-line). In state s+E, 
the system has suffered a single bit plane failure but there are no 
available spare bit planes to replace the failed on-line bit plane. 

The system must use the single-error-correction circuitry to correct 
one error in each memory word in this state. The FAIL state is the 
system state when an uncorrectable error has occurred. 

The development of transition and state probability equations for 
this system will now be shown. 
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For the transition event to occur from state i to state i+1 
('1 £ i £ s) in the time period from t to t + At, exactly four subevents 
must occur. These subevents are: 


E-.: The failure of exactly one on-line bit plane in the 

time period from t to t + At given the non-failure 
of all on-line bit planes prior to t, 

E,,: The non-failure of the system error detector (group) 
prior to time t + At, 

E^: The non- failure of the system reconfiguration switching 
circuitry (group) prior to time t + At, and 

E/: The non-failure of at least one available spare bit plane 
prior to time t + At. 

These subevents belong to classes 5, 2, 2, and 2, respectively. 

The subevent probabilities may be computed as: 


Pg (t,At) = r^(t + At) 


p£ (t,At) = rg(t + At) 

3 

Pp (t,At) = 1-(1 - r(t + At))^^ " . 

■ H- 

where all symbols are as defined in Tables 1 and 2. 
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Symbol 

r(t) 

rj(t) 

!■(.(« 

fjlt) 


TABLE 2. Definition of Reliability Symbols 
for Basic System 


Meaning 

Reliability of an on-line or available spare 
bit plane at time t. 


Reliability of the system error detector (group) 
at time t. 


Reliability of the system error corrector (group) 
at time t. 


Reliability of the system reconfiguration switching 
circuitry (group) at time t. 
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Assuming subevent independence, then 

1 2 3 4 

= k (r{At))^‘^"''Ul-r(At)) 

' ^ At)(l-(l-r{t + At))^^”'*"^^^) 

For 1 £ 1 < s , 

Denote Fjjj(t) by r^^ and r^^ (t + At) by Then 

P^^i+-|{t,At) lc{r(At))^'^"^^n-r(At)) 

• rj"rg-{l-(l-r")^^"’’’*‘^^) 

For 1 £ 1 5 S. 

The state transition event from state i to state s+2 (1 <_ i ;i s + T ) 
represents a transition of the system from a condition in which no 
on-line bit planes are failed to a condition in which exactly one on-line 
bit plane is failed and no non-failed spare is available for replacement. 
The subevents composing this transition event are: 

^2’ ^5 (E^ and Ey )] 

where Ep Eg, and are as previously defined and Eg, Eg, and Ey are 
as described below. 

Eg: The non-failure of the system error correction (group) 
prior to time t + At. 
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Eg: The failure of all s - 1 + 1 available spare bit planes 
° prior to time t + At. 

Ey: The failure of the system reconfiguration switching 
circuitry (group) prior to time t + At, 

The state transition probability may now be formed as 

^,s+2 

which reduces to 

P|^s^.2(t,At) = k (r(At))^'^'^^l-r(At)) r^-r^' 

■ n - rg-'(l-0-r-')^®'^'**\b] 

For l^i_<s + l. 

Define for each state i in the system state diagram a probability 

P.^. (t,At) 

which is the probability that, if the system is in state i at time t, 
no transition out of state i will occur before time t + At. Then, for 
the basic system, 

Pi^^.(t,At) + P.^ .^-j(t,At) + = 1 

forl^-ij^s,- 

*^s+l ,s+l ’’s-H,s+2^^’^^^ - *^s+l 



: ^0 

’’’ ^s+2,FAIL^^’^^^ = 1, and 

The formulation of the equation for P- _-(t,At), then, uniquely 
specifies the equation for P. p.^j.j_(t,At). Since, for this system, the 
non-transition event involves fewer subevents than the transition 
event to the failed state, these non-transition equations will be developed 

For states 1 through s+1 , the only event occurrence which is 
necessary for the non- transition event to occur in time t to t + At 
is the non-failure of the k on-line bit planes in the same time interval 
given that all were non-fail ed at time t. Then 

P:; ^(t,At) = (rlAt))*^ for 1 i <_ s + 1. 

The operation of the system error detector and corrector is required 
for the system to be in state s+2 at time t. The non-transition event 
for this state, then contains the subevents 

The non-failure of the system error detector (group) 
in the time period from t to t + At, given its non- 
failure prior to t, and 

Eq: The non-failure of the system error corrector (group) 
in the time period from t to t + At, given its non- 
failure prior to t* 

In addition, none of the k-1 on-line operating bit planes may fail from 
t to t + At. Then 


4 
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For any state i, (1 £ i £ s), computed as 

= l-Pi^i(t,At) - Pi,3^.2(t.At) - P^^,-+^(t,At} 

= 1 - (r(At))^ - k(r(At))^*^“^^(l-r(At)) 

• [1 - Tg (l-(l-r-)^^“‘’'^^b] 

- k(r(At))^^"^Hl-if{At))rj^rg''(l-(l“r')^^"’'^^h 

= 1 - (r(At))’^ - k(r(At))^‘^“'’Hl-rUt]) 

• EV^'c"" 

= 1 - (rCAt))*^ - k(r(At))(‘'“‘‘J(l-r(At))rj- 

. EV + 

For 1 £ i £ s. 

For states s + 1 and s + 2, 

= 1 - (r(4t))'‘ - k(r(4t))f''‘^>(l-r(it))rj'i'^', 

and 

= 1 - rjj(At) r^(At)(r(At)j^’^"''^ 


The state probabnity equation for state 1 may be obtained by 
use of equation (3-T), the general probability equation. The resultant 
equation is 

P-jCt + At) — (l*'Pi ” ^1 ~ ^1 ^^1 

= P^^.,(t,At)P^(t) (3-3) 

= {rUt))^p.j(t). 

The state probability equation for states 2 through s may be obtained as 
P.(t + At) = P^._-|^i-(t,At) P-^-j(t) + n-P^*^^-^.^(t,At) 

= P^_1^-(t,At) P._-j(t) + P^^^.(t,At) P.(t) 

+ (r(At))'' P,.(t) 

For 2 ^ i < s. 

For state s+1, the state probability equation is 

P3+I (t + At) = (t. At) P^tt) + P^^, (t,At) P^^., (t) 

= k [(r(At))*'''''>(l-r-(At)) Pj'rj'd-O-r'WlPjtt) 
+ .(r(At))'' P^^^(t) 
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So for all states i where 2 < i ^ s + 1 , 

P^(t + it) = k C(r(it))^''‘'''(l-r(it))rj'rj'(l-(l-r')**'''')] 

- P,-.,(t) + (Kit))'^ P.(t), (3-4) 

for 2 ^ i ^ s + 1 

The state probability equation for state s+2 is 

s+1 ^ 

PsH.2(t + 4t) = I Pj,s+2(t.4t) P^.(t) t Ps+2,s+2C^’‘t> P^^^Ct) 
J I 

2^-] 

= I k (r(4t))^''‘’Ul-r{it)) r.'f - 

J=1 0 c 

• D - rj'O-O-r')*®"^''^’')] Pj(t) 

+ (rjCAt)) (r^(it)) (r(4t) )(''-’ >P^^ 2 (t) • 

Reduction of this equation yields 

Ps+^Ct + At) = l< (HAt))^’^~^^(l--rCAt)) r^^r^' 

• I O-r P 4 (t) 

j=l ^ P 

+ (ri(4t))(r^(it))(KAt))'''-''>P3^.2(t). (3-5) 

The state probability equation for state FAIL is 


34 


s+2 

^FAIL ^FAIL^^^ 

= f D-(r(At))’' - k(r(At))f'"''){i-r{At))r/ 
k=l ° 

+ (l-(r(At))*^- k(r(At))^‘^“^^(T-r{At))rj"r^"Pg^^(t) 

+ n-(rd{At) ) (r^Ut) ) (r(At) ) 

" k=l '*' ■*'/^FAIL^^^ ■*■ ^s+2^^^ 

S*^l 

- I [(Kit))" + k(r(4t))<'‘"'')(l-r(4t))r.' 

k=1 “ 

• [!•'+ (1-V) tj' (l-(l-r')'®‘"^’h]] P|;(t) 

- [r^Cit) r^(it)(r(it))'"-"] 

'1-1’ C(r(it))" + k (r(it))(''"''>tl-i-(it)) r.' 
k=1 

•[V t 0-V) V 

- Crj{it) r-^(it) (Kit) )("-■'>] Pj+jCt). (3-6) 

The system reliability may now be computed as the summation of 
probabilities of being in any state other than the failed state. Then: 
s+2 

R(t) ^ r p^-(t) = (3-7) 

where the P.^s are obtained from equations (3-3) through (3-6). 
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It is only necessary then, to compute the probability of the occurrence 
of state FAIL at any time t to determine the system reliability at that 
time. 


Coverage Equations 

System coverage (G) is defined to be the probability that the system 
will recover given that a failure has occurred [21]. This probability 
is useful in reliability calculations and provides an indication of the 
effectiveness of a fault- tolerant system. Hence, a derivation of 
coverage equations for the basic system will now be shown. 

If the system's states are examined, it is evident that a failure 
in the time period from t to t + At m'^y be grouped into 1 of 3 classes 
dependent upon the failure's effect on the system state at time t + At. 
These classes are 

1. The failure causes no change in system state, 

2. The failure causes a transition to another system state 
which is not the failed state, and 

3. The failure causes a transition to the failed state. 

The occurrence of class 1 and 2 failures contribute to system coverage 
while the occurrence of class 3 failures does not. Denoting the 
probability of the occurrence of class L - type failures given that a 
failure has occurred in the time period from t to t + At by P(L), then 

C(t) = P[l) + P(2) 

But P(l) + p(2) + P(3) = 1 
so C(t) = 1-P(3) 

= l-P(Class 3 failure/a failure has occurred in t to t+At) 
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In general, however, the subevents which constitute the class 3 
failure event are dependent on the current system configuration (or state) 
To overcome this difficulty; the state coverage, C.(t) (system coverage 
given that the system state at time t is i) is introduced, where 

C- (t) = 1 - P(state i, class 3 failure/a failure has occurred 
in t to t + At where the system is in state i at 
time t), 

and a state i, class 3 failure is a component failure which causes a 
transition from state i to the failed state. 

Now, by Bayes' Theorem, 

" P(B/A^)P(Ai) + ••• + P(B/A^}P(A^J 

where 

P(A», A ) = 0 for 1 £ 0 , r £ n 

J * 

and 

P(A-j or A 2 or ... or A^) = 1. 


The following events are considered 

A-j: No failure has occurred in t to t + At; 

Agi Occurrence of a state i, class 1 failure in t to t + At; 

Ag! Occurrence of a state i, class 2 failure in t to t + At; 

A^: Occurrence of a state i, class 3 failure in t to t + At; 

B: Occurrence of a failure in t to t + At where the system 

state at time t is i. 
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Then 


Ci 


(t) = 1 - P(A4/B) 

P(B/A^)P(A^) 

" ■ P(B/A^)P(A^) + P(B/A2)P(A2) + P(B/A3)P(A3) + P(B/A4)P(A4) 


But PCB/Ag) = PCB/Ag) = P(B/A^) = 1 , and P\B/A-,) = 0 , so 

P(A^) 

^ " m2) + PIA3} + P(A”J 

= 1 „ P(Qccurrenca of a state i, class 3 failure t to t+At) 

“ P'(OccurVence of a failure In t to t+At/state at t is i)' 

Since each occurrence of a state 1 , class 3 failure results in a 
transition from state i to the failed state and no other conditions 
cause this transition, it follows that 


P( 0 ccurrence of a state i, class 3 failure in t to t + At) 

= P (transition from state i to the failed state in t to t + At) 


To compute the probability of a failure in the time period from 
t to t + At, a hypothetical series syston S, which contains all system 
components for state i, may be constructed. 

If the reliability, Rg(t), of this system is computed, then the 
failure density function of the system may be obtained as 


f3(t) = 


d R^(t) 
' ”dt 


The probability of system S failure in the time period from t to t + At 
is 
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t + At 

P(Failure of S in t to t + a t) = J 

t 

= Rs(t) - Rg(t.+ At) 

as was shown in a previous section. 

The reliability of a series system is the product ef all system 
component reliabilities, so 

"i "i 

P(failure of S in t to t + At) = n r-(t) - n y. {t + At), 

j=l ^ j=l ^ 

where n. is the number of components in S and r.(t) is the reliability 

■ J 

of the j'th system component at time t. 

Since the failure event for a series system occurs when any system 
component or combination of components fails and since S contains all 
components of interest for state i of the original system, then 

P(Failure of S in t to t + At) 

= P(occurrence of a failure in t to t + At/state at t is i) 
"i "i 

= n r.(t) - n ri(t+At), 
j=i ^ •' 

where n^ is the number of components in state i of the original system. 
As was shown previously, 

r.(t) = 1 and r.(t + At) = r.{At) 

J J J 

for a system component o which is required for operation in state i at 
time t. If the number of these components is m. , then 
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P(occurrence of a failure in t to t + At/state at t is i) 
m. n.-m. 

= n r.(t) - n r (At) n r. (t + At). 
j=l ^ q=l ^ k=l 


For state i (1 £ i £ s+1) of the basic system, this probability is 


d 'c 's 


•where all symbols have been previously defined. 
Then 


C.(t) = 1 - 
1 




r ,r r r 
d c s 


(s-i+l)_(^(At))k^^^r^^^^.(^.)(s-i+TT 


(3-8) 


for i < i £ s + 1 . 

For state s+2, obtained as 

^s-»-2,FAIL^^^ 


1 - 


r^ - (r(At))^*^r^(At)r^(At)rg- 


(3-9) 


Recalling that 


Cj(t) = P(system will recover/a failure occurs in t to t + At 
' where the system is in state i at time t), 


then 


P[(System will recover/a failure occurs in t to t + At) and 
the system is non-fail ed at time t] 
s+2 

= I C (t) P (t). 
i=l ^ ^ 

Since, for a non-repairable system, it is meaningless to compute 
coverage for the system after it has failed, the total system coverage 
may be considered to be 
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C(t) = P[(System win recover/a failure occurs 1n t to t + At)/ 
the system is non-f ailed at time t]. 

This is of the form 

p[ A m 

whereas the previously derived equation is of the form 


Since 


P[ A and B]. 


p[ A m = - 


then 


C(t) = Total system coverage 
s+2 

I C*(t) P.(t) 

= 1=1 ^ ^ 

s+2 

I P^*(t) 

i=" ^ 

s+2 

I C.(t) P.(t) 

_ i-1 ^ ’ 


TO" 


(3-10) 


where the C^'s are obtained from equations (3-8) and (3-9), the P^'s 
from equations (3-3) through (3-6) and R from equation (3-7). 


Computer Evaluation 

Three approaches to computer evaluation of equations of the type 

presented will be described in this section. These methods are: 

1) Manual substitution of transition probability equations 
into the general state probability equation and evaluation 
of the state probability equations each At, 
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2) Evaluation of the transition probability equations and 
substitution of the results into the general state 
probability equation each At, and 

3) Evaluation of a product of a T - type matrix and a T 
matrix which is updated each At. 

Methods 1 and 2 are straightforward. Method 3 will now be discussed. 

It was shown in a preceding section that 

£(t + At) = X P{t) (3-2) 

where P(t + At) and ^{t) are state probability vectors and T contains 

Pi i(t,At) in its i,j location 
T 5 J 

Then 

P(t + 2At) = X P(t -t* At) 

where T-j is T evaluated at time t + At. By substitution, 

P(t ‘f* 2At) = X [f^ X P(t)3 

= X f'’] X £(t). 

In general, 

P{t + nAt) = X X ... X X T^] P(t) 

= [T X X ... X ^ "^n-1^^ 

= Ht) 

where = [T x x ... x x (3-11) 

Thus, 'to evaluate P(t + nAt) when P(t) is known, the following algorithm 
may be used. 
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1. Evaluate T at time t, set T.= T, i = 1. 

2. Evaluate T at time t + i A t to obtain T-. 

1 

^i+1* " ^i* ^ ^i* 

4. If i < n then i = i + 1, go to 2. Otherwise, £(t + nit) 
Tp* X £(t), stop. 

For a system with a small number of states and state transitions, 
method 1 is managable. For systems with a large number of states, how- 
ever, either method 2 or 3 is more expedient. Example flowcharts for 
methods 2 and 3 are presented in Appendix A. Program listings may be 
found in [27]. 

The selection of a suitable At for use in the computer evaluation 
of these equations is a difficult task. Tiiis problem will now be 
discussed. 

The time period At was originally defined to be a time period in 
which no more than one state transition is likely to occur. Since 
the probability of more than one state transition occurring may be 
represented as a product of state transition probabilities, the 
monitoring of these products during execution will give an indication 
of the appropriateness of the selected At. 

By specifying a maximum allowable probability, to'' the 

occurrence of two state transitions in th,a At, and reducing At when 
this probability is exceeded, the computational error may be reduced. 
The following algorithm will implement this self-monitoring control 
for a method 3- type evaluation. 

1. Evaluate T at time t, set T^*= T, i =1. 

la. Specify initial At, 
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2. Evaluate T at time t + iAt to obtain T. . 

2a. For each non-diagonal entry Ti (j,k) compute Ti{ 3 ,k)-(T.(ksm) 
for each m. 

2b. If any of these products is greater than p reduce At 
and go to 2. ^ 

^i+1* ” ^i* ^ ^1’ 

4. If i < n then i - i + 1, go to 2. Otherwise £(t + nAt) = 

X P(t), stop. 


In general, the value selected for p^ is dependent on the 
subsystem failure rates and the computational accuracy of the com- 
puting system used. For the computations of this paper, satisfactory 

results were obtained by the use of p„,„ in the range from .0001 to 

niaX 

.000001. 

The magnitude of the computational error accumulated at time t may 
be approximated by determining the magnitude of the difference of the 
sum of all state probabilities and 1. In equational form, 

ie(t)i = ii - I 

i=l ’ 

Where N is the number of system states. 

The percent error in system reliability may be approximated by 


'.{t)% = X 100^. 


IV. RELIABILITY EQUATIONS FOR ALTERNATE SYSTEMS 


This chapter will show equatlonal developments for the reli ability 
of the non-spared, TMR, duplicated and double-error-correcting systems. 
A method will also be shown which allows the computation of the 
probability of various memory word fault patterns and the effects of 
these patterns on system reliability. 


Non-Spared System 

The non-spared system 1s capable of operation in only 3 states. 
These states carrespond to states 1, s+Z, and FAIL in the basic system. 
By substitution of 0 for s in the equations for the basic system, the 
state probability equations for states 1, Z, and FAIL of the non-spared 
system are obtained as follows; 


and 


P^5 (t + it) = (r(it))%s^(t) (4-1) 

(t + it) = k(r{it))<'‘‘”(l-r(it)) rj'r^'P„s^(t) 

t (rjCit))(r^(it))(«-(it))^''''''p„S (t) (4-2) 

P„- (t + it) = l-ttr'(it))'‘ + k(r(it))f'''’'(l-r(4t))r^'r^;] 
NispAXL 

■ Pj,3 tt) - [r^{fit)r^{At)(r(At))^^"^^]Pf^5^(t) 

(4-3) 
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where the Phc 's are obtained from equations (4-1) through (4-3) 
r.s . 

TMR System 

The reliability of the TMR system may be approximated from the 
reliability of the non-spared system by application of the classical 
TMR equation. From [24], this equation is 

- [3(R(t))2 - Z(R(t))5] r„(t). 

where R(t) is the unreplicated unit reliability, and r^ytt) is the 
reliability of the voting and codeword testing circuitry. 

Then 

RTOR(t) = [3(R^5(t))2 - 2{R|^js(t))^] ry.j.{t), (4-5) 

where ’'S obtained by use of equation (4-4) 

Dupl icated System 

The reliability of the duplicated system may be computed by 
determining the probability of the various operational modes of the 
system. These modes are: 

1. Both hon-spared units operate correctly, 

2. the unit currently on-line fails, and the sense switching 
circuitry switches the system output to the other unit 
which is non-f ailed, and 

3. the Unit currently off-line fails. 

The reliability of this system, then, is 
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Rpt*) “ + [I - 

^ El “ 

= R^stt) + [1 - '>ns<*'> 3 R^jCts) T53(t), (4-6) 

where r (t) is the reliability of the sense-switching circuitry and 
is obtained by use of equation (4-4). 

Double-Error -Correcting (DEC) System 
Carter and McCarthy [20} have described a fauTt-tolerant memory 
system of the double-error-correcting type which utilizes a software 
■implementable double-error-correction algorithm. The algorithm is based 
on a concept of memory word error modeling which will now be described. 

The non-opera tional modes of a memory word bit cell are assumed 
to be: 

1) Stuck-at-one (s-a-l), and 

2) Stuck-at-zero (s-a-0). 

The occurrence of either of these modes is termed a fault . 

The class Of all faults may be partitioned into two subclasses 
by the effect of each fault on the correct memory word bit. If the 
fault is of the s-a-x type and the correct memory word bit for that 
location is x, then no effect on the memory bit occurs. Faults of this 
subclass are termed failures. If the fault is s-a-x and the correct 
bit is x, then the fault causes an incorrect response oh a memory read 
operation. Faults of this type are called errors . 

The weight of a binary word is defined to be the number of 
binary digits (bits) in the word which are logic T. By analysis of 


47 


the words of a particular code the sum of all codeword weights, VI, may 
be obtained. An average codeword weight, w is computed by 

where V is the total number of codewords. If w is divided by N, the 
length in bits of each codeword, an approximation to the statistical 
probability of any given bit of a word being a logic 1 is obtained. 

In equational form. 


P(Word bit = 1) = and 


POrford bit = 0) = P ^0 1 - = 1 - 


A statistical analysis of faults for a manory system should isolate 
the following probabilities for the bit locations o'P a data word. 

P(Bit location s-a-1 /location faulted) = Pg-j and 

P(Bit location s-a-O/location faulted) = 

i^t is now possible to obtain the probability of a failure when it 
is kno'ktx that a single word fault has occurred. This probability is 


P(failure/1 fault) = PC(BIT location s-a~l /location faulted) 

and Word Bit -1] + PE(Bit location 
s-a- 0 /locati on faulted) and word bit 

\ "" *^S1 ^vJ ’’s0 ^W0* 


In a similan manner. 
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P(Error/l fault) = Pj, P^j + P^^ P^, 


Sinice P {fall ure/1 fault) + P(error/l fault) 

" ''si •’wl '"S0 ^W0 ^S0 ^wl '"si ^w0 

= 1*1=1, 


the binoRU'al probability distribution may be used to compute the ^probability 

of any combination of errors and failures in a word given that a attain 

number of faults has occurred. . \ » 

Then 

P(n failures and m errors/n + m faults) 

= C»'’si''wl ^ '’S0'’w0>"<'“si'’w? ^ '’S0’’wl>1^ 

If the binomial distribution is also used .t\i compute the probability 

\* 

of n+m faults, then 


P(n failures and m errors, n + m faults in b bits) 


- ^n+m^fp p .p p xn/p p +p p b x b-(n+m),, Jn+m) 

^ n ^^^srwr'^S0‘^w0^ ^'^srw0 S0^wV ^n+m^'^ 

where r is the reliability of a memory word bit location. 


Since (p°^) is the number of n+m-fault words which may occur and 


is the number of ways that exactly n fail ures may be ordered among 
n+m faults, then the number of distinct m+n-fault words with n failures 


fn+m\r b 
‘ n ^'* n+i 

order of failures) n+m-fault words is then 


is ('^1; )(„:„). The total number of distinct (with regard to number and 
n n+m 


i 
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n+m 


r y ("t"') T f ) 


These numbers may now be used to obtain the percentage of f-fault 
words which contain a -given number of failures. For example, the 
percentage of f-fault words of b bits which contain f failures is 


(pff) 


ii 

i=o ^ ^ 


X 100% 


1 


^ f 

c I th] 

i=0 ^ 


X 100%, 


a useful figure, since an f-fault word with f failures is error-free. 
The application of these concepts to the double-error-correcting 
system will be shown following a discussion of correctable error types 
for the system. 

A fault pattern vector for a memory word is defined as 
FPV = (he jf, qe nf) 

where h and q are the numbers of errors in the memory word data and 
check bits, respectively, and j and n are the numbers of failures. 

The double-error-correction algorithm discussed will alv/a.ys 
produce a valid correction when presented with memory words with FPV's 
of certain forms. These forms, from [20], are as follows. . 

(2e 0f, 0e 0f); {le 0f, le 0f); (0e 0f, 2e 0f)j 

(2e If, 0e 0f}-, (2e 0f, 0e If); (0e 0f, 2e If); 

(2£ If, 0e If); (2e 0f, 0e 2f); (0e 0f, 2e 2f); 

(0e 0f, 4e 0f). 
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For memory words with FPV’s o'f^ the follov/ing forms correction may or 
may not be attempted and results may be invalid [20] 

(le 0f, le If); (.0e If, 2e 0f); 

{le 0f, le 2f); (0e If, 2e If). 

No error correction is attempted in the following cases [20] 

(le If, le 0f); 

(2e 2f, 0e 0f); (le 2f, le 0f); (le If, le If); 

(4e 0f, 0e 0f); (3e 0f, le 0f); (2e 0f, 2e 0f). 

It should be noted that the preceding FPV’s listed all contain an 
even number of errors and will produce error syndrome vectors of even 
weight. The computation of a syndrome of this type by the memory 
translator causes the invocation of this algorithm. 

A second algorithm has been designed to attempt data reconstruction 
when an odd-weight error syndrome is computed. Since many triple-error 
patterns produce a single-error syndrome and a high percentage of these 
syndromes imply an error in a valid bit, a critical function of this 
algorithm is to distinguish between single and triple word errors. 

This algorithm is capable of reconstructing all memory words with 
FPV's containing exactly one error and two or fewer failures. In 
addition, all memory words with FPV’s containing one error and three 
failures are corrected with the exception of the FPV 

(0e 3f, le 0f) 

for which no reconstruction is attempted [20]. 


Valid results, [20], are also produced for 


(0e 0f, 3e 0f) and (0e 0f, 3e If). 

Correction results are variable [20] for memory words with the 
following FPV*s 

(2e 0f. le 0f); (1e 0f, 2e 0f); 

(2e 0f, le lf)s (le 0f, 2e If). 

No correction is attempted, [20], for the case listed above and the 

cases 

(3e 0f, 0e 0f); (3e 0f, 0e If). 

The listings above show that any combination of two or fewer 
faults in a memory word will be algorithmically corrected. For words 
with three faults, the percentage of v/ords which are corrected may be 
computed as follows. 

The number of ways in which three faults may appear in a word with 
k bits is 

(The number of ways 3 faults can appear) + 

(The number of ways 2 faults and 1 error can appear) + 

(The number of ways 1 fault and 2 errors can appear) + 

(The number of ways 3 errors can appear) 

= C(^) + 3(^) + 3(^) (^)] = 8 (^) 

The first term of this sum represents all 3-fault vjords with no errors. 

No correction is required for these words. In addition, the triple 
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error algorithm will correct all 3 ( 2 ) three-fault words with only one 
error. 

A 3-fault word containing 2 errors will not be corrected if the FPV 
is of the form 

(le If, le 0f) . 

If the number of data bits in the word is D and the number of check bits 
is C, then the number of 3-fault patterns of this form is 

The number of 3-fault words with 2 errors for which correction is 
uncertain is 

+ (?)(2) = 2 D(^) -i- D(^) = 3 D(^). 

A 3-fault word with three errors will be corrected if the FPV is 
of the form 

(0e 0f, 3e 0f). 

The number of patterns of this form is 



The number of 3-fault words with three errors for which correction is 
'-:ncertain is 

[(“n?) + (“)(2)1 = W 2 I + 0<2)3- 


53 


The total number, T^, of 3- fault words which are correctable is then 
bounded as follows 

[( 3 ) + 3(^) + 3(^) - 2C(§) + (^)] < T 3 < 

[(^) + 3(|) + 3(^) - 2C(“) + (^) + 3D(^) + C(^) + D(^)] 

= Cnj) - 2C(^) + (C)] < Tj < [7(^) - C(“) + (^) + 4D(^)]. 

Since there are 8 ( 3 ) possible ways that 3 faults can occur, the 
percentage, u^, of 3 faults words that can be corrected is 

■^3 

Uo = — ^ X 100%. 

5 8 {|) 

For the (22, 16) code of the basic system, Ug may be computed as: 
75.96% < U 3 < 89.45% 

A breakdown of double-error-correcting system correction percentages 

by the number of memory word faults is shown in Table 3. In this 

table, u denotes n errors which are system correctable, u 
m^n 

denotes the total percentage of m-fault FPV's which are correctable. 

The switching strategy assumed for the double-error correcting 
system is as follows: 

1) If a memory word is detected to have a single error, the 
single error correction procedure is performed. 

2) If the word has two errors, one of the faulty on-line 
bit planes is sv/itched out and replaced with a spare. 
Error correction is attempted by use of the double-error 
correction procedure. 




TABLE 3. Percentage of Memory Word FPV‘s Correctable for 
the Double-Error-Correcting System (22, 16) Code 


F 

# FAULTS 

e 

# ERRORS 

C% correctable/lOO:^) 

x(% of F-fault words with e errors/1 00?i) 

0 

0 

“o.o ' ’ 

0 

0 

“o = I 

1 

0 

“1.0 ■ ^ 

1 

1 

“1.1 = 

1 

0,1 

“, = 1 

2 

0 

^2,0 

2 

1 

^2,1 ^ 

2 

2 

= .25 

2 

0,1,2 

Ug = 1 

3 

0 

“ 3,0 ” 

3 

1 

“ 3.1 “ 

3 

2 

.258 £ Ug 2 — 

3 

3 

.0016 < u„ „ < .078 

3 

0,r,2,3 

.7596 < U 3 < .8945 
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Table 

# 


3 (continued) 

F e 

FAULTS # ERRORS 

^F,e 

{% correctable/100%) 

x{% of F-fault words with e errors/100%) 

4 

0 

= .0625 

4 

1 

“4,1 “ 

4 

2 

.102 £ 2 £ .119 

4 ■ 

3 

.0005 £ u^ 2 £ .0395 

4 

4 

^ 4,4 “ 


4 0,1, 2, 3,4 


.4151 <\i^< .4711 
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3) If the word has either three or four errors, a correction 
is attempted. If the correction is successful, faulty 
on-line-'bit planes are replaced with spare bit planes until 
either all ava-ilai^le spares are exhausted or only one 
faulty bit plane ranains on-line. 

A T matrix may be constructed as shown in Figure 6 with the system 
configut ation in each state as shown in Table 4. 

Appendix B shows the derivation of the state transition probability 
equations for this system. If the notational simplifications 

f = E(x,y), and r(At) = r* 

k=0 ^ 

are made and the reliability of the algorithmic correction procedure 
is denoted fay r^^, then the transition equations appear as follows: 

= D(k,l) 

P^^3(t,At) = D(k,2) 

P-j^4(t,At) = D(k,3) rj"r^-*r^'rg"(l-E(2,l)). 

P"] ^g{t,At) = D(k,4) '^s 

+ D{k,3) rj'r^,-'r^"rg''(E{2,l)-E(2,C)) 

+ D(k,4) r^-rj,"r^"rg'-(E(2,2)-E(2,l}). 

Pl^s+4(t,At) = D(k,3) Vr^-r^'d-r^' t rg- E(2,0)) 

+ D(k,4) r^-rj,"r^"r3-(E(2,l)-E(2.0)). 

Pi^s+ 5^^’^^5 = DCk,4) ''s' ^^2,0)). 



FIGURE 6. T-Matrix for Double- Error- 
Correcting System 































•TABLE 4. State Configurations- For Dpuble-Error-Correcting System 

State Configuration 

1 K Good bit planes on-line, S available spares 

2 _< i 5 s 2 K-1 Good bit planes on-line s - i + 2 available 

spares 

s + 3 K-2 Good bit planes on-line 

s -t* 4 K-3 Good bit planes on-line 

s + 5 K-4 Good bit planes on-line 


FAIL 


An uncorrectabl e v/ord error exists 




. . . I 


I 



T 






I 1 


? 7 

■ -i 






if ^ 


i ; 

i ! 

■I J 


n 



n 
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P^^|(t,At) = D(k,0). 

P.^..^^(t,At) = D(k-l,l) O-E(i,0)) 

for 2 £ i £ s + 1 . 

= D(k-1,2) i^d**'c*'^A*'"s" 

for 2 £ i £ s. 

P^^^+3(t.At) = D(k-1,3) 0-E(i»2)) 

for 2 < 1 £ s - 1 . 

P^^5^.3(t,At) = D(k-l,1) ^ ’"s" 

+ D(k-1,2) {E{i.l)-E{1,0)) 

+ D(k-1.3) Vr^%*rg- (E(i ,2)-E(i.l)) 

for 2 £ 1 £ s. 

Pi,£+4(t,At) = DCk-1,2) r^%*r^* + r^' E(1,0)) 

+ D(k-1,3) Vr^*r^*rg' (E{i,l>E(i.O)) 

for 2 £ i £ s+1 . 

Pi,s+5(t,At) = D(k-1.3) r/r^*r^* (l-r^' ^ r^' E(1,0)) 

2 < i < s+1 . 

P^^^(t,At) = D(k-l.O) r/r^*V 
for 2 £ i < s+2. 

= “('‘-’•’I »'d*''c*'-A* '•s' ECs+l.")). 

+ D(k-T,2) (E(s+l.l)-E(s+l,0)) 


• • i. . • 






1. 


.1 .. 


for 3 < J < 5. 
for 4 £ J < 5. 

'’s+4,s+5(‘>**) = D(k-3.1)rj*r^*r^*. 
’’s+J.s+jt‘-^t) = D(k-j+l,0) r^*r^*r^* 
for 3 < j _< 5. 

_4,_. 

= ■''*D(k,0)-r^-r^-'r^-' X D(k.j). 

! o . J=T 


1.FAIL' 


(t,4t) = l-r/r^%* J_ D(k-l.j) 


for 2 < i < s+2. 


'”s+q,FAIL<*^“> “ ■'-'•d*'"c*''A* -L 

J ^ 

for 3 ^ q £ 5. 

The state nrobabiUty equations for this system are also derived 
in Appendix B. The njsultant equations are 

P^ (t + At) - D{k,0) P^(t) . (4-7; 

Pg{t + At) ■: D(k,l) + D(k-1,0) 


PgCtt At) == D(k,2) rjj'r^:"r^-r^"(l-E(2,0))‘P^{t) 
- D{k-1,0) P3(t)]. 


I 




p|:t + At) = D(k,3) r^'-r^^Yrg-'Cl-EC 

+ D(M,1) rgr Cl-ECS^O)} P3{t) + D(k-1,0^ 
Pg(t + At) = D(k,4) 0-E^^ 

+ r^%*r^* [D(k-1,3) r3Ml-E(2,E)) 


+ DCk-UZ) (1-E(3,D) PgCt) 


(4-11) 


+ D(k-l,1) r^- (1-E{4,q)} P^(t) + D(k-T,0) Pg(t)]. 


P^(t + At) = r^*r^*r^* [D(k-1,0) P.(t) 


r / ^ D(k-1.j)(l-E(i-j,j-l) P_. . 
s jUi i-J 


(4-12) 


for 6 < 1 < s+2. 


Ps+3{t + At) = r^-'r^'r^'[D(k,2)(T-r3" + ^"^(2,0) 

4 

+ I D(kJ) r ' (E(2,j-2)-E{2,a-3))] P,(t) 

J=3 ^ ‘ 

+ rd*rc*r/ f [D(k-1 .1)0-^ 

k^2 

+ D(k-U2)Tg" (E(k,l)-E(k,0;) 

+ DCk-1,3) r^r CE(k,2)-E(k,1))] P,^{t} 

+ rD{k-^,T)(1-r ■' + r/ E(s+r.O)) 

+ D(k-1.2) (E{s+1 J)--E(s+T,0))] P^^^^ 

+ D{k-ia) + D(k-2,0) (4- 
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+ D(k,4) Tg" (E(2,l) - E(2,0))] P^(t) 
s+1 

+ Td*r^*r^* J [D(k-l,2)(l-rg* + Tg" E(j,0)) 

*3 ^ 

+ b(k-l,3) Tg- {E(3M)-E{j,0))] Pj.{t) 

+ D(k-U2) Pg^2^t) + D(k~2,l) Pg^.3(t) 

+ D(k-3,0) Pj^^(t). (4-14) 

Pj+jlt + it) = D(k,4) l-d'tj'i-A'n-'-s' + I-/ Pi(E) 

s+1 

■*■ i D(k-T,3)(l-rg' + Tg- E(j,0))Pj{t) 

j“2 

+ D(k-1,3) Pg^2t‘‘^^ D(k-2.2) Pg+gCt) 

+ D(k-3,1) Pg^.^(t), (4-15) 

- ' 4 ■ 

Pp^^IlC^ + At) = 1 - D(k,0) + D(k,j) P^(t) 

S*r2 3 

{ I D(k-UJ) Pjt)) 

° k=2 m=0 ■ 

5 5**n 

■^1(1 D(k-n+l,q) P-j.„{t))] . (4-161 

n=3 q=0 

It should be noted that these equations are developed for a 
double-error-correcting system with s+2 greater than 5 (equivalently, 
more than 3 spare bit planes). If s+2 = i where 2 i ^ 5, then the 
equations involving state j where i £ j £ 5 should be modified to delete 
this state. This modification will involve only the deletion of the 
appropriate equations. 



If the system is defined to be operating satisfactorily in states 1 

through s+5 and P^{t=0) =1, P. {t=0)=d, then the system reliability- 

i?^l 

may be completely specified as 

s+5 

~ (4-17) 

where the P^-’s are obtained from equations (4-7) through (4-16). 


V. ANALYSIS RESULTS 


In this chapter, typical results of analyses performed on the five 
systems previously described will be discussed. Comparative reliabilitie 

i 

of each system are shown and the effect of varying several system 
parameters is described. 

The base variable values assumed [22] for the system analyses are 
shown in Table 5. For each analysis performed, the system variables 
are fixed at the base value unless otherwise noted. 

A comparative reliability analysis of the five subject systems was 
performed by use of equations {3-7), (4-4), (4-5), (4-6), and (4-17). 

The results of this analysis are displayed in Figure 7. This figure 
shows the reliability of the TMR, non-spared, duplicated, basic, and 
double-error-correcting systems for mission lengths of four years or 
less. Also shown is the reliability of a simplex system with no error- 
detecting or correcting capabilities. This system consists of 16 on-line 

— 5^ppt 1 c 

bit planes and has reliability (e ) where Xgp is obtained from 
Table 5. It may be seen from this figure that, for missions of 1/2 year 
or less, all of the systems except the non-spared and simplex systems 
have reliability greater than .99. For greater mission lengths, however, 
the reliability of the non-spared, duplicated, and TMR systems decrease 
rapidly. For a 3-year mission, probably only the basic or double-error- 
correcting systems would be acceptable. 


TABLE 5, Base Values for System Variables 


# On-line Bit Planes 

22 

# Spare Bit Planes 

4 

Bit Plane Failure Rate 

2.6384/10® HR 

Detector Failure Rate 
Reconfiguration Switch 

.900/10® HR 

Failure Rate 

.583/10® HR 

Corrector Failure Rate 

.027/10® Hr 

DEC Algorithm Failure Rate 

0 

Mission Length 

3 Years 

Memory Size 

16k Words 

Failure Distribution 

Exponential 

4K-B1t Subplane Failure Rate 
‘’erlpheral Bit Plane Circuitry 

.5596/10® Hr 

Failure Rate 

.3/10® Hr 


TIME (YEARS) 


FIGURE 7. Reliability of Subjert Systems 



Comparison of the curves for the double-error-correcting and Basic 
Systems shov/s the reliability improvement to be expected from the use of 
the software algorithms of the double-error-correcting system. For 
1/2-year missionsj this improvement is negligible. For missions of 
greater lengths, however, the reliability improvement gained by the use 
of this system becomes important. 

It is interesting to note that, while the duplicated and TMR 
systems represent a doubling and tripling of memory bit planes over the 

non-spared system, the basic and double-error-correcting systems result 
in much higher system reliabilities with an addition of only 4 bit planes 
to the non-spared system. 

Figure 8 shows the results of a reliability analysis performed on 
the basic system for various numbers of spare bit planes. The 
corresponding curves for the double-error-correcting system are shown 
in Figure 9. Comparison of these two figures shows that the same degree 
of reliability achieved by the basic system with 4 spare bit planes may 
be reached by a double-error-correcting system with 3 spares and a 
sufficiently reliable double-error-correction algorithm. The need for 
one spare bit plane may thus be aleviated by the use of software 
error correction. 

The reliability of the software error correction algorithms used in 
the double-error-correcting system is highly important to system success. 
The effects on the double-error-correcting system reliability made by 
varying a hypothetical failure rate for the CPU hardware which implements 
these algorithms is shown in Fiaure 10. 




TIME (YEARS) 

FIGURE 9. Reliability of Double- Error-Correcting 
System for Various Numbers of Spares 
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Also essential to overall system reliability is the failure rate of 
the detector. The effects of varying this failure rate are shown 
in Figure 11. 

The reliability of double- error-correcting systems with various 
memory capacities is shown in Figure 12. The major effect of memory size 
on the reliability of a system of this type is in the bit plane 
failure rate. Also affected are failure rates of memory size-related 
components such as address decoder circuits, however, only the bit 
plane failure rates are considered in this figure. The failure 
rates used were obtained by assuming that each bit plane is composed 
of 4K-bit sub-planes and peripheral circuitry, each with a failure 
rate as shown in Table 5. 

The results of the memory capacity analysis show that for 
missions of 1 year or less, double-error-correcting type memories 
containing up to 64K words will achieve high reliability. Greater mission 
lengths show a reliability decrease for the larger capacity memories with 
a dramatic decrease for memories larger than 32K words and a three-year 
mission length. 

The coverage of the basic system for various numbers of spare bit 
planes is shown in Figure 13. Coverage may be defined as the probability 
that the system will continue to function given that a failure occurs. 

As such, the coverage of a system is useful in analyzing the system's 
behavior after component failures of a nature not predictable by system 
failure rates. 




FIGURE 11. Reliability of Double- Error-Correcting 
System vs. Detector Failure Rate 
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TIME (YEARS) 


FIGURE 13. Coverage for Basic System 



It may be seen from this figure that a basic system with no 
spares is highly vulnerable to system component failures. As the number 
of spare bit planes increases, however, this vulnerability decreases 
rapidly until, in the system with 4 spare bit planes, there is a 
probability of .96 or greater of successful operation after a failure 
for missions of 3 years or less. 

Overall results of the analyses performed show that a high degree 
of system reliability may be obtained by a Judicious combination of 
coding, modular sparing, and software error correction. Substantial 
reliability improvement over massive replication techniques is achieved 
with relatively low cost. While some sensitivity is shown to the 
reliability of system control components, fault-tolerant techniques 
applied to these components should assure high system reliability. 


VI. CONCLUSION 


A technique for the development of reliability and coverage 
equations for a class of non- repairable fault- tolerant memory systems 
has been presented. The methods discussed have been applied to several 
systems and typical results have been shown. 

The basic and double-error-correcting fault-tolerant memory systems 
have been shown to achieve high reliability at minimal cost- These 
systems make efficient use of the spare bit-planes provided and th'“ error 
correction capabilities of the code. By use of software correction^ 
the double-error-correcting system adds an additional level of error 
control and may reduce the need for one of the spare bit planes. 

A major advantage of the calculation methods presented here over 
more traditional reliability calculation methods is the allowance of a 
finite At for state transition occurrence. The use of this finite time 
increment allows multiple system events to occur during any state 
transition. The need for separate states to represent these event? is 
then diminished. The result is a state diagram with a reduced number of 
states with probability equations that are easily computer-implemented. 

A disadvantage of this method is the lack of a closed form solution 
which is easily obtainable by use of other methods. Because of the 
dependency of the state probabilities at time t + At on the conditions 
at time t, small errors in computation at one time may cause large 
errors at succeeding times. A closed form solution should eliminate 
rhis problem. 


1 


I 

T 

> ^ ( 


Further work in this area could include the following: 

1. Development of a closed-form solution from the equations 
of this method, 

2. Research into the effect of un-powered spares on system 
modeling, and 

3. Application of these methods to the repairable system 
problem. 
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APPENDIX A 


n owcharts for Computational Algorithms 

Three methods for computer evaluation of the equations of this 
paper were outlined in Chapter III. Flowcharts for evaluation by use of 
Methods 2 and 3 are shown here. 

Figure 14 shows a typical implementation of evaluation Method 2. 

For this flowchart, is selected to be 0 and the system starting 

state is state 1. TMAX is the mission length of interest. 

After initialization, all transition probabilities are calculated 
for the current time (T) and At. Where a two-state transition is 
possible, the product of the two single-state transitions involved is 
formed. If this product is greater than PMAX, the maximum allowable 

V * 

two-state transition probability, the At is reduced. 

The amount of this reduction is arbitrary. If At and T have units 
of hours, then a convenient method of reduction is to multiply At by 
.9 and set the new At equal to the greatest integral number of hours 
less than this number. When this method is used, however, a test must 
be performed to assure that At is notO since this condition would 
prevent any further processing. 

If all the two-state transition probabilities are less than PMAX, 
the state probabilities for time T + At are computed by substitution 
of the transition probabilities and state probabilities for time T into 
equation (3-1), the general state probability equation. If T is less 
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than TMAXj T is incremented by At and processing continues. Otherwise, 
the system reliability is formed as a suitable sum of state probabilities 
results arc output, and processing terminates. 

Figure 15 shows an implementation of a Method 3 evaluation. This 
flowchart follows the steps outlined in the second computational 
algorithm of Chapter III. 

It should be noted from equation (3-11) that if the base computation 
time is 0 and the system starting state is state i so that P^. (0) = 1 
then contains the state probabilities for state j in its (i,j) 

location. For this case, then, the multiplication by P_(t) to obtain 
£(t + nAt) is unnecessary since the state probabilities may be 
determined directly. 




FIGURE 15. Flowchart for Reliability 
Computations by Method 3 





FIGURE 15. Continued 





APPENDIX B 

Development of Equations for the Double-Error-Correcting System 

A listing of transition events and subevents causing the transitions 
is shown in Table 6. In this table, the success of the detector, 
corrector, correction algorithm and switch prior to time t + At are 
represented by D^, C', A", and W", respectively. Success in the time 
interval from t to t + At is denoted by a superscript. The non- 
success event is denoted by a subtraction of the appropriate symbol 
from 1. 

For the derivation of the transition and state probability equations, 
the following notation will be used. 

D(x,y) = P(y correctable on-line bit plane errors out of x 
possible on-line bit planes given all were good 
at time t) 

E(x,y) = P(y or fewer good spare bit planes out of s - x + 2 
available) 

k=0 " 

’’m* ■ 

The double-error-correcting system transition probability equations 
may now be specified as 
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# on-line 
correctable 
B-P Errors 

# possible 
bits 


1,2 

1/K 

1,3 

2/K 

1,4 

3/K 

1,5 

4/K 

,s+3 

2/K 


3/K 


4/K 

,s+4 

3/K 


4/K 

,s+5 

4/K 

1,1 

m 

,i+l 

1/K-l 


2<i <s+l 


[■Jil 


OQ f 

■ f 

i 

>. Events and Subevents for ^ \ 

e-Error-Correcting System 

SUBEVENTS CAUSING TRANSITION -- i 


Other Subevents 


D^C'A' 

D''C'A‘'W’'(at least 1 good spare) 

D''C''A''W"(at least 2 good spares) 

D'C'‘A’'W-'(at least 3 good spares) 

D''C''A'*((1-W-') or W-'(No good spares)) 
or 

D-'C-'A"W'(exactly 1 good spare) 
or 

D-C'‘A''W'* (exactly 2 good spares) 

D-*C''A-'((1-W-') or W^lno good spares)) 
or 

D''C'‘A'*W''(exactly 1 good spare) 
D''C'A'^{(1“W^) or W"(ho good spares)) 

D*C*A*W"(at 1 east 1 good spare) 






TABLE 6. continued 


i,i+2 

2/K-l 

D*C*A*W-'(at least 2 good spares) 

Z_<i£S 



i , i+3 

3/K-l 

D*C*A*W-*(at least 3 good spares) 

2<l5S-l 



to 

1/K-l 

D*C*A* ((1-W'') or W-'(no good spares) 



or 


2/K-I 

D*C*A*W"( exactly 1 good spare) 



or 


3/K-l 

D*C*A*W'' (exactly 2 good spares) 

i,s+4 

2/K-l 

D*C*A*{ (1-V/") or V^Cno good spares)) 

2^i<s+l 


or 


3/K-l 

D*C*A*W-' (exactly 1 good spare) 

ijS+5 

3/K-l 

D*C*A*( (1-W') or W'(no good spares)) 

2£i£S+T 



i,i 

0/K-l 

D*C*A* 

2£i<£+2 



s+1 , s+3 

1/K-l 

D*C*A*((1-W‘') or W'(np good spares)) 



■ or 


2/K-l 

d*C*A*W’' (exactly 1 good spare) 

s+2,s+j 

j-2/K-l 

D*C*A* 

3<a<5 
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TABLE 6. continued 

f 


s+3,s+j 

3-3/K-2 

D*C*A* 



4<j<5 





s+4,s+5 1 

l/K-3 

D*e*A* 


■.-i 

s+j,s+j 

0/K-j+l 

D*C*A* 



3iJ£5 






The double-error-correcting system transition probability equations 
may now be specified as: 

Pl^3(t,At) = D{k,2) r^'rc'r^'rs-(l-E{2,0)). 

P^^^{t,At) = D(k,3) rj"rj."r^'rg'(l-E(2,l)). 

Pl^g(t,At) = D(k,4) r^-'rc'ry^'rg^(l-E(2,2)). 

'’i,5+3<*’''‘> = “<‘'•2' WV^l'V ’’s' 

+ D(k,3) rj'r,,'r^'r5'(E(2,l)-E(2,0)) 

+ D(k,4) rj'r^'rj^'r5'(E(2,2)-E(2.1)). 

+ D(k,4) rj|'r^'YEs'(E(2.1)-E(2.0)). 

Pi;5+5(t.4t! = D(k.4) rj'r^'r„'(l-r3' +r^' E{2,^ 

P-[^-[(t,At) = D(k,0). 

P.^^.^^(t,At) = D(k-l,l) r^*r^*r^*r3-0-E(i,o}}, 
for 2<i£S+l . 

P.^.^gCt^At) = D(k-1,2) rj*r^*r^*r/ {l-E(i,D) 
for 2.£i<s., 

P/^^.^3(t;At) = D(k-1,3) (l-ECi,^2^ 

for 2<i<s-l. 
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+ 0{k“l ,2) 'f'd*’^c**'A*’^5^ (E(i ,T)“E(i P)) 

+ U ( k-1 »3) (i »2) -E( 1 ,T ) ) 

for 2£i<s, 

P,_5,.4(t.At) •= 0(k-l,2) rj%*rA* '■'-'■s' ’’s' 

+ D(k-1,3) (E(1,1)-E(1,0)) 

for 2<i<s+1. 

Pi_345(t.4t) = 0(k-1.3) (l-'-s' + ''s' 

for 2<i£S+1 • 

P. i{t,st) - D(k-l.O) 

for Z^i^s-t-2. 

'’s+1 - DCk-l.'n Pd*'"c*'’A* ''-'■s' ■'■ '■s' E'*'-’’'’" 

+ D(k-1,2) Pd*''c*'"A*''s' 

'“5i-2,s+j'*’‘*> " Pd*'’c'‘''A* 

for 3<jj<5. 

for 4^j<5. 

Pd*'‘c*»'A* 

for 3<4_<5, 

Since 

s+5 

P^_P„Jt.it) = 1 - ,1^ Pl.jU.At). 


C - 2 ^ 


,T. 
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j 


|j- 

n 


H 


^ i 

'V.U 


Sj : 

If -t 


! 


the following equations may be developed: 

'■l .FAIL('->‘t' = 1 - Pi,2(t,At) - Pi,4(t,it) - P,,5(t.At) 

- Pi,s+3(t’^*' ■ Pi ,s+4(*’**’ ■ 


^1 


1 - 0[k,0) - D(k.l) V’^c'V 

- D(k,2)r^T^r^Lr'(l-E(2,0)-hl-r^"+ rgE(Z,0)] 

- D(k,3)r^r-r^Cr-(l-E(?-,l))+r5(E(2,l)-E(240)} 

+ 1-r/ + r ' E(2,0)] - D(k,4) 
s ^ 

• [^ 5 " (1-E(2,2)) + (E(2,2)-E(2,l)) 

+ Tg '(E{2,l)-E(2,0)) + l-i^s' + ifs" E(2,0)3 
= 1 - D(k,0) - D(k,l) rd"'f'c"V " ^d"V^A' 

- D(k,3) - D(k,4) 

4 

= 1 - D(k,0) - 

■G * 

Pi.FAIt't.^) = 1 - - P^_4+2(t.At) - Pi.,+3(t=At) 

for l<i«-l - Pi,s+ 3 <*-‘t) - Pf.s+4(*>''*) ■ Pl.s+S(*>''*' 

- P^ 




94 


' [rg^(l-E{^^0)) + 1-r^' + r/ E(i,0)] 

- D(k-1,2) Cr^' {l~E(i,l)) 

+ (E(i,l) - E(i,0)) + 1 - + r^' E(i,0)] 

- D(k-1.3) rd*r^*r^*[rg ' (1-E{i,2)) 

+ (E(i,2) - E{iM)) + 1 - E(i,0) 

= 1 - rj*r^*r^*[D{k-l,0)+D(k-l,l)+D(k-1,2} 

+ D{k-1,3)]. 

Ps,FAIL^^’^^^ = ■>-Ps,s+l^^>^^^-Ps,s+2(^’^^5-Ps,s+3(t’^t) 

= 1 - VW ■** 

• (1-E{s,0)) + 1-r^' + r^' E(s,0)] 

+ D(k-l,2)[rg' (1-E{s,l)j + r^' (E(s,1 )-E(s,0)) 
+ l-rg'+ rg'E(s,0)]+D(k-l,3)[r^' E(s,2)-E(s,l ) 
+ r5'(E{s,l)-E(s,0)) + 1-r^' + E(s,0)3 

= 1 - D(k-1 ,0}+D(k-l ,1 )+D(k-1 ,2) 

+ D(k-l,3)[r^' E(s,2) + 1 - r^'] 

rt 

But E(s,2) = 5 ; = 1 • 

n— n 


so 
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^ D(k-l,l) + D(k-1,2) 

+ D(k-1,3)3. 

^’s+l ,FAIL^^»^^^ ^‘^s+1 ,s+2^^*^^^“^s+l ,s+3^^’^‘‘^^”*’s+l 

-Ps+l,s+l^^>^^) 

= 1 - VW D(k-1.0)+D(k-l,l)[r5"(VE(s+1.0)) 
+ I'^s' ■*■ E(s+l,0)3-{-D(k-l,2) 

. [i-r^-+ rg'E{s+l,0) + rg-(E(s+l,l)-E(s+l,0))] 
+ D(k-1 ,3)[r^-*(E(s+l .1)-E(s+1 ,0)) 

+ 1 „ + ’f's' E(i*0)] 

Ps+i^FAIL^^’^^^ = 1 - »'d*W D{k-1>0)+D(k-Ul)+D(k-1,2) 

• O-r^' + Tg' E{s+lJ)] + D(k-1,3) 

[r^-' E(iJ) +1 - Tg'] 

*1 

ButE(s+l,l)= I = 1 

q=0 ^ 

So 

^+l,FAILf^’^^^ " l-r^*Vr^*[D(k-1,0)+D(k-l.l)+D(k-l,2) 

+D(k-1.3)3. 

^s+2,FAIL^^’^^^ ~ ^■*’s+2»s+2^^’^^^'*^s+2,s+3^^’^^'"*’s+2,s+4^^*^^^ 

= l-rd%*r^*[D(k-l ,0)+D(k-l ,1 )+D(k-l ,2) 


+ D(k->,3)]. 
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So 


= l-r/r^*r^*[D(k-2,0)+D(k-2,l)+D{k-2,Z)]. 
= l-r^*r^*r^*[D(k-3,0)+D(k-3,l)]. 

Ps+5,FAIL^^*^^J = = ■'-V^c*V 


P’I^FAIL^'*^’'^^^ = l-D(k,0)-r(j^"rj,'r^" D(k,j). 


for 2<i<s+2. 


5-J 


'"s+a,FAIL^’^’^‘^^ " ^-''d*’"c*V q^Q D(M+l,q) 


for 3£jj<5 . 


By substitution of the transition probability equations into the 
general state probability equation, the state probability equations 
for the double-error-correcting system are obtained as follows: 


P^(t + At) = P-,^.j(t,At)P^(t) 

= D(k,0)P^(t). 

Pg(t + At) = P-j 2("t»At)p2(t) + 

= D(k,l) 

+ D(k-1,0) Pgft). 
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P 3 (t + At) = P^^ 3 (t,At)P^(t) + P2^3(t,At)P2(t) + P 3 , 3 ! )P3( 0 

= D(k,2) O-E{2,0))P^(t) 

+ rd*r^*r^* CD(k-lJ)rs' (l-E(2,0))P2(t) 

+ D{k-l,0)P3(t)]. 

P4(t + At) = P-i^4(t,At)P^(t) + P2^4(t,At}P2(t) + P3^4(t,At)P3(t) 

+ P4^4(t,At)P4(t) 

= f)(k,3} Wr^-rg- (l-E(2,l))P^(t) 

+ rd*rc*rA*[D(k-l,2)rg'‘ {l-E(2»l))P2(t) 

+ D(k-1»1) Tg' n-E(3,0))P3{t) + D(k-l,0)P^(t)3. 

Ps{t + At) = P-,^g(t,At)P^(t) + P2^5(t,At)P2(t) 

+ P3^5{t,At)P3(t) + P4^g{t,At)P^(t) + P5^5(t,At)Pg(t) 

= D(k,4)rd'VrA'V n-E(2,2))P^ (t) 

+ n-E(2>2))P2(t) 

+ D(k-l,2)rg^ n-E(3,l))P3(t) 

+ D(k-l,l)rg<* (l-E(4,0))P4(t) + D{k-l,0)Pg{t)j. 

P^(t + At) = P-f_3 •! (t»At)Pj^3(t) + P'f-2,1 

+ P^.„^^:j(t,At)P4^^(t) + P^.^^(t,At)P^(t) 

= ^d*W W'k-l,3)rs- (l-E(1-3,2))P^„g(t) 

+ D(k-l.2)rgMl-E(1-2,\))P^.„2(t) 

+ D(k-l.l)rg'* O-E(i-l,0))P^.^^(t) + D(k-l,0)P.(t)] 
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P.(t + At) = r^*r*Tf* [D(k-l,0)P^.(t) 




for 6<i<s+2. 


J *" 1 

= rj"rc'r^'CD(k.2)(l-r5'* + r/ E(2,0)) + D{k,3)r5' 

* (E(2,l)-E{2,0)) +D(k,4)rg' (E(2,2)-E{2,l))]PT(t) 

+ VW i + r/ E(j,0)) 

+ D(k-l,2)rg- (E(j,l)-Eu\0)) 

+ D(k-l,3)r2' (E(a,2)-E(j,l))]Pj.(t) 

+ [D{k-l,l)(l-rg- + r^' E(s+1,0) 

+ D{k-1,2) fg' {E(s+UD - E(s+i,0))]Pg^^(t) 

+ D{k-l,l)P5^2(t) + D(k-2,0)P5^3(t) . 

s+3 

Ps+4(t + At) = I 

4l1 • 


= r'^'r^'r'p^'lU(KZ){'\-r^' + r^" E(2,0)) 

+ D(k,4) Tg- (E(2,l) - E(2,0))]P^(t) 
s+1 

+ VV‘‘A* i [D(k-l,2){l-*r3^ + Tg' E(j,0)) 
j—2 

+ D{k-l,3)rg' {EU,l)-E{a,0))]P^(t) 

+ a(k-l,2)Pg^2^^> D(k-.2,l)Pg^3(t) 


+ 


B(k-3,0)Pg^^ 



I 


J 


T 


J ^ 


1 : 
1 = 
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p^^g{t + At) = I Ps+5,s+5t^’^‘^^''s+5^‘^^ 

= D(k,4) r/r^'V O-fs' + kj' E(2.0))P,(t) 

+ I D(k-l,3)(l-r_'* + Tg- E(j,0))Pj(t) 

a c ft j^2 

+ D(k-l,3)P5^2(^) + D{k-2,2)Ps+3(t) 

+ DCk - 3 J ) P 3 ^. 4 ( t ) . 

S *^5 

PpAu(* ■*■ “ i, ^ ’’fail'*' 

J ♦ 

4 

= D-D{k,0) - wr^- I D(k,G)]PT(t) 

3 "" ^ 

s+2 3 

+ C‘*'*V^c**'A* D(k-Um)]Pq(t) 

^ 5 ^h 

+ I ^ D{k-n+l,q)]Ps+^(t)+Pp^jL(t) 

n=3 q=0 

S ' t'S 

= I Pr ( t ) + PpATL ^ t ) ■ '' ci ''' c '’' A ' 

r=l 


• I D(k,i)l,P,(t) + Vr * 1 -^* 
0=1 ' 
s+E 3 

r i- il D(k-l,m))P (t) 
q=2 m=0 ^ 

5 5-n , ^ ■ 

-in D(k-n+l.q)Ps+„(t))] 
n=3 q-Q 


S " F 5 

But l^ P^(t) + Pp„L(t) = 1 


I 
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4 

Pp^lJt + Lt) =1 - [D(k,0) + ^ D(k,j)]P-,(t) 

j “*1 

s+2 3 

-V'-cV n, C I D(k-l.ni)P ( 1 ) 
q-(i m-Q 

+ Z ( r D(k-n+l,q))Ps+n(t)] 

n=3 q=0 


